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Abstract 



High-energy physics experiments are currently record- 
ing large amounts of data and in a few years will be record- 
ing prodigious quantities of data. New methods must be 
developed to handle this data and make analysis at universi- 
ties possible. Grid Computing is one method; however, the 
data must be cached at the various Grid nodes. We examine 
some storage techniques that exploit recent developments 
in commodity hardware. Disk arrays using RAID level 5 
(RAID-5) include both parity and striping. The striping im- 
proves access speed. The parity protects data in the event 
of a single disk failure, but not in the case of multiple disk 
failures. 

We report on tests of dual-processor Linux Software 
RAID-5 arrays and Hardware RAID-5 arrays using a 12- 
disk 3ware controller, in conjunction with 250 and 300 GB 
disks, for use in offline high-energy physics data analy- 
sis. The price of IDE disks is now less than $1/GB. These 
RAID-5 disk arrays can be scaled to sizes affordable to 
small institutions and used when fast random access at low 
cost is important. 

INTRODUCTION 

We have tested redundant arrays of integrated drive elec- 
tronics (IDE) disk drives, using the Linux operating sys- 
tem, for use in particle physics Monte Carlo simulations 
and data analysis. Parts costs of total systems using com- 
modity IDE disks are now at the $2000 per terabyte level. 
A revolution is in the making. Our tests include reports 
on Software and Hardware redundant arrays of inexpensive 
disks - Level 5 (RAID-5) systems running under Linux . 
RAID-5 protects data in case of a catastrophic single disk 
failure by providing parity bits. Journaling file systems 
are used to allow rapid recovery (minutes rather than days) 
from system crashes and power failures. 

Our data analysis strategy is to encapsulate data and CPU 
processing power together. Data is stored on many PCs. 
Analysis of a particular part of a data set takes place lo- 
cally on, or close to, the PC where the data resides. The 
network backbone is only used to put results together. If 
the I/O overhead is moderate and analysis tasks need more 
than one local CPU to plow through data, then each of 
these disk arrays could be used as a local file server to a 
few computers sharing a local network switch. These com- 
modity network switches would be combined with a single 
high end, fast backplane switch allowing the connection of 
a thousand PCs. To this end, we have also successfully 



measured the file transfer speed of Network File System 
(NFS) software over a local Gigabit network. 

RAID 1 1 1 stands for Redundant Array of Inexpensive 
Disks. Many industry offerings meet all of the qualifica- 
tions except the inexpensive part, severely limiting the size 
of an array for a given budget. This is now changing. The 
different RAID levels can be defined as follow: 

• RAID-0: "Striped." Disks are combined into one 
physical device where reads and writes of data are 
done in parallel. Access speed is fast but there is no 
redundancy. 

• RAID-1: "Mirrored." Fully redundant, but the size is 
limited to the smallest disk. 

• RAID-4: "Parity." For N disks, 1 disk is used as a 
parity bit and the remaining N— 1 disks are combined. 
Protects against a single disk failure but access speed 
is slow since you have to update the parity disk for 
each write. Some, but not all, files may be recoverable 
if two disks fail. 

• RAID-5: "Striped-Parity." As with RAID-4, the effec- 
tive size is that of N— 1 disks. However, since the par- 
ity information is also distributed evenly among the 
N drives the bottleneck of having to update the par- 
ity disk for each write is avoided. Protects against a 
single disk failure and the access speed is fast. 

Hardware and Software RAID-5, using enhanced inte- 
grated drive electronics (EIDE) disks under Linux soft- 
ware, is now available |2|. Redundant disk arrays do pro- 
vide protection in the most likely single disk failure case, 
that in which a single disk simply stops working. This re- 
moves a major obstacle to building large arrays of EIDE 
disks. However, RAID-5 does not totally protect against 
other types of disk failures. RAID-5 will offer limited pro- 
tection in the case where a single disk stops working but 
causes the whole EIDE bus to fail (or the whole EIDE con- 
troller card to fail), but only temporarily stops them from 
functioning. This would temporarily disable the whole 
RAID-5 array. If replacing the bad disk solves the problem, 
i.e. the failure did not permanently damage data on other 
disks, then the RAID-5 array would recover normally. 

TEST SETUP 

To get a large RAID array one needs to use large capacity 
disk drives. There have been some problems with using 
large disks, primarily the maximum addressable size. We 



Table 1 : Comparison of Large EIDE Disks for a RAID-5 Array 
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Maxtor Maxline II |3j 


300 


5400 


$0.75 


75 


2MB 


3 year 


Western Digital WD2500JB [4| 


250 


7200 


$0.61 


80 


8MB 


3 year 


Maxtor MaXLine Plus II 


250 


7200 


$0.69 


80 


8 MB 


3 year 


Maxtor MaXLine Plus III SATA0 


300 


7200 


$0.77 


100 


8MB 


3 year 


Seagate Barracuda SATA f3 


200 


7200 


$0.63 


100 


8MB 


5 year 


Hitachi DeskStar 7K400 Q 


400 


7200 


$1.10 


80 


8MB 


3 year 



have discussed these problems in an earlier papers (HJ [9] 
10 1 . Using arrays of disk drives, such as those shown in 
Tabled one can create Multi-Terabyte RAID arrays. 

To test both the Software and Hardware RAID-5 arrays 
we started with a base system with two different modifi- 
cations. The base system consisted of the following: 120 
GB Western Digital system disk, a MSI K7D Master MPX 
motherboard, a dual 2 GHz AMD Athlon for the CPUs, 
with 1024 MB DDR memory, a Gigabit Ethernet card. We 
also needed to use a second Power Supply (15A at 12V) 
to supply enough power at startup for all the disks. The 
startup power needed was about 450 Watts, we also needed 
24 inch EIDE Cables and , due to space and air-flow con- 
siderations, we use the round EIDE cables rather than the 
ribbon cables. (Base cost: $1400) 

For Software RAID-5 we used either eight 250 GB West- 
ern Digital disks 0] or eight 300 GB Maxtor disks Q and 
2 Promise Ultral33 PCI (Peripheral Component Intercon- 
nect) EIDE controller cards 111 II . We originally planed to 
try using 3 Promise cards but we found the there were too 
many PCI Interrupt request conflicts. This limited us to 
8 disks, which was convenient since we would have oth- 
erwise run into the 2 Terabyte "disk" size limit of Linux 
kernel 2.4 at this point. (Cost: Base plus $2300 for a total 
of $3700 or about $1850/TB.) 

For Hardware RAID-5 we used twelve 250 GB West- 
ern Digital disks and the 3 ware 12-disk RAID controller 
7506-12 1121 . This controller also had a 2 Terabyte "disk" 
size limit. To overcome this limit we had to split the 12 
disks into 2 Hardware RAID-5 arrays of six disks, each 
forming a 1.2 TB RAID-5 array. We then combined them, 
using Software RAID-0 into a 2.4 TB array. We then up- 
graded to the Linux 2.6 kernel, which does support arrays 
larger than 2 Terabytes (2 32 512 byte blocks) when used 
with the newer 3ware Escalade 9500S-12 controller but not 
our 7506-12. (Cost: Base plus $3800 for a total of $5200 
or about $1890/TB.) 

RESULTS 

After confirming the fact that the CPU swapping algo- 
rithms do allow for efficient use of dual-CPU computers 
(kswap was typically 10 — 15% for CPU intensive jobs), 
we tested the array write speeds with a simple program that 
wrote 3.28 x 10 9 Bytes of plain text ("All work and no play 



make Jack a dull boy" ). We had the following results: 



Software RAIDS 

For the Software RAID-5 we only used 8 of the 250 
GB disks, thus we were under the 2 Terabyte "disk" size 
limit of Linux kernel 2.4. Using the test program described 
above we had a base (with only that job running) write 
speed of 29 MB/s. For two concurrent writes (2 instances 
of the the same job writing to 2 different files) we had a 
speed of 24 MB/s, an overhead of 17%. For a read test we 
copied the file to the system disk for a rate of 37 MB/s. 
When simply copying the file back to the RAID-5 array 
we had a write speed of 33 MB/s. The CPU overhead of 
journaling, Software RAID, and writing the file was about 
10 — 15%, but for a single instance write this was running 
on the other CPU. Therefore, for fast CPUs, the overhead 
of Software RAID-5 is negligible. 



Hardware RAIDS 

Because of the 2 Terabyte "disk" size limit of Linux ker- 
nel 2.4, we first tried using nine 250 GB disks (out of the 
12 possible disks) and Hardware RAID-5, forming a 2 TB 
array. The base write speed, as described above, was 41 
MB/s. We then used Linux kernel 2.6 so we could have a 
larger RAID array, however, we discovered that the Hard- 
ware RAID controller also had a 2 TB "disk" size limit. 
Therefore we created 2 Hardware RAID-5 arrays of six 
disks, each forming a 1 .2 TB RAID-5 array. We then com- 
bined them using Software RAID-0 into a 2.4 TB array. We 
now had a base write speed of 29 MB/s. For two concurrent 
writes we had a speed of 25 MB/s, an overhead of 14%. We 
then turned off the RAID-0 and performed some additional 
speed tests. Again, when simply copying the file back to 
the RAID-5 array we had a write speed of 33 MB/s. When 
we copied the file from one Hardware RAID-5 to another 
the speed was 37 MB/s. The CPU overhead of journaling 
and writing the file was about 1 — 5%, making the Hard- 
ware RAID-5 array more efficient but, if one is using the 
array as only a disk server then even a single CPU would 
suffice. 



Gigabit Network 

To test the practical speed of a local Gigabit network we 
connected the Software RAID-5 array and another Linux 
computer together using an inexpensive 8 -port commod- 
ity Gigabit switch 1131 . We then first mounted the array 
via NFS. When mounted using synchronous NFS we had 
a write speed of 13 MB/s and when mounted using asyn- 
chronous NFS we had a write speed of 18 MB/s. When 
simply copying the file to the RAID-5 array we had a write 
speed of 23 MB/s. These speeds should be compared with 
the base internal write speed of 29 MB/s and 33 MB/s for 
copying. The combined network and NFS overhead is 38% 
for asynchronous writing and 30% for copying. When we 
tried simple FTP we had a speed of 22 MB/s, a network 
overhead of 33%. 

Motherboard Performance 

One should note that in previous tests 1 10 1 we did see 
much higher writing speeds. For those tests we used a dif- 
ferent motherboard 1 14] (we were only testing single-CPU 
arrays) that was noted for efficiently bridging the PCI bus. 

FUTURE RECOMMENDATIONS 

Over the last 3 years we have put together 5 RAID-5 ar- 
rays of various sizes |9] 1101 . The arrays have been used 
at SLAC on BaBar and at CERN for CMS Monte Carlo 
EJ. This totaled 40 EIDE disks. Over 25% failed within 3 
years, fortunately within the warranty period. Some of this 
rate may be attributed to power failures, or perhaps a bad 
batch, but it still seems to be too high a rate. Given this fail- 
ure rate and other considerations we would consider mak- 
ing the following recommendations. If you plan to build- 
it-yourself you should use hot-swappable Serial Advanced 
Technology Attachment (SATA) drives with at least a 3- 
year warranty. The connectors for SATA drives also take 
up far less space and they can safely operate over longer 
cable lengths than standard EIDE drives. This is important 
if you try to either use a tall "Tower" case or even a double- 
wide "Tower" case. The double-wide "Tower" case has 
the advantage of better airflow for cooling. We have used 
Supermicro CSE-733T-450B cases to build boxes with hot 
swappable SATA disks in non-RAID5 systems. 

If you want to use 12 disks you will need a Hardware 
RAID controller similar to the 3ware Escalade 9500S-12 
(or 9500S-12MI) SATA card. You will want a 2.0 GHz 
AMD Athlon (or better) CPU. To connect to your local area 
network (LAN) of processing computers you will want at 
least Gigabit Ethernet, either a card or on the motherboard. 
You might also consider using a Fiber Channel Arbitrated 
Loop (FC-AL). FC-AL nominally runs at 100 or 200 MB/s 
and can be daisy-chained between computers or connected 
to Fiber Channel switches. An FC-AL PCI card typically 
costs $500 and comes with two ports for connection to a 
simple loop or to a fabric switch. A fabric switch typically 
costs $15000 for 16 ports and allows more simultaneous 



connections for increased speed. To exceed the 2 TB "disk" 
size limit you will also need to use Linux Kernel 2.6 and a 
journaling file system such as ext3 [16| or ReiserFS 1171 . 
ReiserFS created the journal faster than ext3 but it is not 
supported by RedHat Enterprise Level 3 version of Linux. 

There are also various commercial RAID systems that 
rely on a hardware RAID controller. Examples of these 
are shown in Table [2] They are typically 3U or larger rack 
mounted systems. In the past, commercial systems have 
not been off-the-shelf commodity items. This is chang- 
ing and while they are anywhere from 1.5 to over eigh- 
teen times as expensive, even allowing for cost of assembly, 
having an off-the-shelf unit is quite attractive. The Apple 
FC-AL Xserve RAID interface runs with both Macintosh 
and Linux computers. 



Table 2: Some Commodity Hardware RAID Arrays| 18 1. 



System 


Capacity 


Size 


Price/GB 


Apple Xserve RAID 


3.5 TB 


3U 


$3.14 


Dell EMC CX200 


2.1 TB 


3U 


$9.05 


HP Storage Works 1000 


2.1 TB 


3U 


$11.39 


IBMFASt200 3542- 1R 


2.1 TB 


3U 


$24.71 


Sun StorEdge6120 


2.04 TB 


2x3U 


$36.57 



CONCLUSION 

We have tested redundant arrays of IDE disk drives for 
use in offline high energy physics data analysis and Monte 
Carlo simulations. Parts costs of total systems using com- 
modity IDE disks are now at the $2000 per terabyte level. 
We have tested Software RAID-5 systems running under 
Linux 2.4 using Promise Ultra 133 disk controllers and 
Hardware RAID-5 systems running under Linux 2.4 and 
2.6 using a 3ware Hardware RAID controller. We found 
about 5% overhead for journaling files systems such as ext3 
and ReiserFS, but given the extra protection and increased 
recovery speed we still recommend them. We also found 
that Software RAID-5 has about 10% more overhead than 
Hardware RAID but the use of dual-CPU systems or us- 
ing the RAID-5 array as a dedicated file server make this 
"cost" negligible for any modern CPU. RAID-5 provides 
parity bits to protect data in case of a single catastrophic 
disk failure. Tape backup is not required for data that can 
be recreated with modest effort. Journaling file systems 
permit rapid recovery from system crashes and power fail- 
ures. 

Current high energy physics experiments, for example 
BaBar at SLAC, feature relatively low data acquisition 
rates, only 3 MB/s, less than a third of the rates taken 
at Fermilab fixed target experiments a decade ago |19|. 
The Large Hadron Collider experiments CMS and ATLAS, 
with data acquisition rates starting at 100 MB/s, will be 
more challenging and require physical architectures that 
minimize helter skelter data movement if they are to ful- 



fill their promise. In many cases, architectures designed 
to solve particular processing problems are far more cost 
effective than general solutions [ 1 9l 1201 . 

Grid Computing 1 2 1 1 will entail the movement of large 
amounts of data between various sites. RAID-5 arrays will 
be needed as disk caches both during the transfer and when 
it reaches its final destination to ameliorate Grid-lock. An- 
other example that can apply to Grid Computing is the Fer- 
milab Mass Storage System, Enstore 1221 . where RAID ar- 
rays are used as a disk cache for a Tape Silo. Enstore uses 
RAID arrays to stage tapes to disk allowing faster analysis 
of large data sets. 
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